UBC Botanical Garden is advancing action for adaptation through the development of community-based adaptation plans with the Sustainable Communities Field School. Ensuring the long-term preservation and sustainable utilization of apple genetic diversity across Canada is of paramount importance. This collaborative initiative aims to inventory and understand existing apple genetic diversity within the country. Additionally, it seeks to investigate how the suitability of habitats for apples may be affected by changing climatic conditions and emerging challenges. In response to these potential impacts, the project aims to identify urgent and longer-term adaptation strategies necessary for the continued resilience of apple populations in Canada.
This work is associated with a in-progress manuscript on Malus crop wild relatives (CWR) that are native to Canada, including M. coronaria (Sweet Crabapple) and M. fusca (Pacific Crabapple).
The following analysis was written in R, a free programming language and software environment primarily used for statistical computing and graphics. Visit https://www.r-project.org/ to get started with using R. Also recommended is the use of RStudio, a integrated development environment (IDE) for R. It provides a user-friendly interface for writing code, viewing plots and data, managing files, and installing packages. Once R is installed, visit https://www.rstudio.com/products/rstudio/download/ to download RStudio.
The species distribution modeling in the following analysis also requires that you have Java (another free programming language) installed. Visit https://www.oracle.com/ca-en/java/technologies/downloads/ and install the Java Development Kit (JDK).
Please note: The following is a reproducible example, and is not intended to be a introductory tutorial of SDM. The markdown is written for someone with intermediate to advanced experience in R.
The following is a reproducible workflow for the development of species distribution models (SDMs), using presence/background occurrence data for M. coronaria and M. fusca crabapple tree species. This data will be explained in more detail bellow. Occurrence data will be downloaded from the open source Global Biodiversity Information Facility (GBIF). There are several types of SDMs, however in this workflow we use the widely adopted MaxEnt (Maximum Entropy), a machine learning algorithum which can handle presence/background data. The outputs of this analysis are models of predicted habitat suitability of fundamental niches from environmental (climate) data. These models predict suitable from historical (1970s-2000s), to future projected climate conditions due to climate change under two Shared Socioeconomic Pathways (SSP2-4.5 and SSP5-8.5). This is essential for understanding how these species may respond to changes in climate, so that they can be better managed, sampled, or conserved for the preservation of these species genetic diversity for future generations.
See the following for a introduction and description of MaxEnt: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.
Here is a brief overview of the steps and associated scripts for each part of the analysis.
| Step | Script | Purpose |
|---|---|---|
| 1 | gbif_occ.R |
Prepare data request and download from GBIF |
| 2 | occ_clean.R |
Clean occurrence (presence) data points |
| 3 | occ_thin.R |
Thin occurrence points to a single observation per predictor (raster) cell |
| 4 | malus_bg.R |
Download ecoregions, sample background points, predictor cluster, 3D Kernel density plots |
| 5 | malus_sdm.R |
Split training/testing data, Max models, habitat predictions |
| 6 | sdm_plot.R |
Produce publication quality plots of SDM predictions |
| 7 | malus_MESS.R |
Multivariate Environmental Similarity Surfaces, assess novel (dissimiar) environments |
The Global Biodiversity Information Facility (GBIF) is a valuable resource for SDM due to its extensive collection of biodiversity data from around the world. There are several sources of occurrences, such as research and herbarium collections and community science observations. Note that this data is presence-only data, meaning there is no information about the absence of species. Also, all occurrences are research grade meaning they are expert verified identifications or several community members agree on identification.
NOTE: You need to have a GBIF user profile, visit https://www.gbif.org/ to sign up!
Load libraries
library(tidyverse) # data management, grammar
library(rgbif) # access GBIF data
Initiate user profile information. Note that I have redacted my own information.
# GBIF user info
user='REDACTED'
pwd='REDACTED'
email='REDACTED'
Navigate to https://www.gbif.org/ and search your taxon of interest to get the taxon key identifier. In this case the taxon keys are: M. coronaria = 3001166 and M. fusca = 3001080.
For the purposes of this workflow I will only show the code for one of the species. Code is just replicated for the other species.
# M. coronaria download
taxonKey <- 3001166
basisOfRecord <- c('PRESERVED_SPECIMEN', 'HUMAN_OBSERVATION', 'OCCURRENCE') # excluded living specimens and material samples (germplasm)
hasCoordinates <- TRUE # limit to records with coordinates
NOTE: We do not limit the year of observation in our data collection. There are reasons why you might. However, for the purposes of this analysis it is reasonable to include all observations. Because the main reason why more historical occurrences may be missing today (or say post 1970) is more likely to be due to habitat destruction, rather than changes in climate affecting realized niches. Furthermore, crabtrees are slow growing, and migrate even slower, thus their distribution are likely lagging behind changes in their fundamental niches. This highlights the importance of understanding the ecology of your species of interest.
Prepare a download request for the GBIF occurrence data. There are several formats, I choose .csv because it is familiar and easy to work with.
# Download data
# Use 'pred()' if there is a single argument, or 'pred_in()' if there are multiple
down_code = rgbif::occ_download(
pred("taxonKey", taxonKey),
pred_in("basisOfRecord", basisOfRecord),
pred("hasCoordinate", hasCoordinates),
format = "SIMPLE_CSV",
user=user, pwd=pwd, email=email)
Download and save the GBIF occurrence data as a .csv.
getwd() # check your working directory (wd)
setwd("./occ_data/") # set wd to a location where you want to save the csv file.
download_coronaria <- occ_download_get(down_code[1], overwrite = TRUE)
# extract csv from zipper folder and save as clearly named csv in excel or equivalent.
gbif_cor <- read.csv(file = "occ_coronaria.csv") # load named csv data
NOTE: It is good practice to keep this .csv as a ‘raw’ data file, and manipulate it within the R environment in following workflow rather than make changes to the raw file.
Now we have downloaded all observations from GBIF for M. coronaria and M. fusca that fit a set of defined criteria suitable for our research question. In this case the criteria are limited to species-level taxonomic resolution, that come from preserved herbarium specimens, human observation (i.e. iNaturalist) and research collections documenting species occurrences, all of which are geo-referenced (i.e. geographic coordinates). Also note that we do not limit the geographic location of occurrences, yet. At the time of downloading this data (02/14/2024) M. coronaria had 1373 records, and M. fusca had 2282 records.
As eluded to above, we now need to make some more choices about what species occurrences are appropriate to include and from where. Thus, we must clean the species occurrence data. To do this we are going to visually inspect the species occurrences as well as take advantage of R packages such as CoordinateCleaner to assist in removing spurious occurrences replicates, those with poor referencing such located in the center of a country or in a body of water, etc.
Load Libraries
library(tidyverse) # grammar, data management
library(CoordinateCleaner) # helpful functions to clean data
library(terra) # working with vector/raster data
library(geodata) # download basemaps
library(scales) # alpha adjust colours
We begin with loading the saved .csv files of species occurrences from the last script.
# set wd
getwd()
setwd("../occ_data/") # use relative path to open from folder where .csv is saved
# load occurrence csv files
gbif_cor <- read.csv(file = "occ_coronaria.csv") # load coronaria data
gbif_fusca <- read.csv(file = "occ_fusca.csv") # load fusca data
Now we want to initiate a data.frame and select columns that are useful from the .csv files. At the same time as we build this data.frame we also are going to use several functions to filter occurrences out of the ‘cleaned’ data set. For more information on the functions used see the library documentation of CoordinateCleaner. Also see https://data-blog.gbif.org/post/gbif-filtering-guide/ for a great blog post published by GBIF on post-processing GBIF downloads.
# filter M. corornia data
occ_cor <- gbif_cor %>%
filter(countryCode %in% c('US', 'CA')) %>% #limit to CA and US
filter(!is.na(decimalLongitude)) %>% # remove records w/o coords
filter(coordinateUncertaintyInMeters < 30000 | is.na(coordinateUncertaintyInMeters)) %>%
cc_cen(buffer = 2000) %>% # remove records within 2km of country centroids
cc_inst(buffer = 2000) %>% # remove records within 2km of herbariums, botanical gardens, and other institutions
cc_sea() %>%
distinct(decimalLatitude, decimalLongitude, speciesKey, datasetKey, .keep_all = T) %>%
filter(decimalLongitude >= -100) %>% # remove some records on west coast, two from bot gardens
filter(!(decimalLatitude < 35 & decimalLongitude < -86)) %>% # remove records from Texas, Oklahoma, Louisiana
filter(!(decimalLatitude > 45)) %>% # remove record from New Brunswick
filter(!(decimalLongitude < -98)) %>% # remove iNat record from south Kansas, northern Kansas record verified by taxonomist
dplyr::select(species, countryCode, decimalLatitude,
decimalLongitude, coordinateUncertaintyInMeters, year, basisOfRecord
)
# Note it is helpful to plot the occurrences bellow, and then add more conditions to clean inaccurate points
# Pay special attention to points at the edge of the range of occurrences, as these are most likely to be suspicious
# as well as influence the model in strange ways, in comparison to inaccurate points well within the other occurrences
After filter the occurrences we are left with 978 occurrences for M. coronaria and 1197 occurrences for M. fusca.
You will notice that I have added a few manual filters in the code above using function filter and the ! operator, to remove records that were verified to be inaccurate. For cleaning these inaccurate points, it is useful to begin by checking occurrences at the edges of the range of a species, or any other outlier points. These points at the edges are most likely to negatively influence the model if they are not true presence points. I personally find using a combination of the GBIF map and table view of observations, and Google Maps to verify the truth of these questionable points to be a good approach.
NOTE: I am using piper operators (%>%) from the tidyverse ecosystem to assist in the ease of data processing. It sequentially applies each argument or function to the data.frame, and so be careful in what order you specify functions.
Now we will save the data.frame object as a Rdata file. Rdata files are the best way to store/load data within the R environment. We will routinely save intermediate data files to help speed up analysis of downstream workflow, so that results can be stored rather than having to be repeated every time.
getwd() # check what directory you are in
setwd('../occ_data/')
saveRDS(occ_cor, file = "occ_cor.Rdata") # name the file with the .Rdata file extension
Now when revisiting an analysis you can reload the saved cleaned data frame rather than re-running the cleaning functions which are slow computations.
setwd('./occ_data/')
occ_cor <- readRDS(file = "occ_cor.Rdata") # read in the saved .Rdata object
head(occ_cor) # take a peak at your dateframe
## species countryCode decimalLatitude decimalLongitude
## 1 Malus coronaria US 43.08833 -84.72617
## 2 Malus coronaria US 43.61555 -84.24722
## 3 Malus coronaria US 40.89888 -74.70638
## 4 Malus coronaria US 41.83333 -88.08333
## 5 Malus coronaria US 42.97083 -82.42500
## 6 Malus coronaria US 35.58000 -82.55583
## coordinateUncertaintyInMeters year basisOfRecord
## 1 NA 2013 HUMAN_OBSERVATION
## 2 20000 1931 PRESERVED_SPECIMEN
## 3 3 1945 PRESERVED_SPECIMEN
## 4 NA 1894 PRESERVED_SPECIMEN
## 5 5000 1892 PRESERVED_SPECIMEN
## 6 20000 NA PRESERVED_SPECIMEN
Early I mentioned that we are not limiting the years in which species occurrences are recorded, however it is still important to verify that this decision is not adding ‘weird’ bias to the data. To do this I am going to split the occurrences in to three date frames pre and post 1970, and those without dates. This data is commonly associated with when changes in significant changes in global climate began. If occurrences from pre-1970 are vastly different in geography, then we may feel uncomfortable including them in the analysis. Similarly those without any date should also be verified.
# compare pre/post 1970 and nd occurrences
occ_cor_pre <- occ_cor %>% # pre 1970
filter(year < 1970)
occ_cor_post <- occ_cor %>% # post 1970
filter(year >= 1970)
occ_cor_nd <- occ_cor %>%
filter(year %in% NA)
The best way to compare these pre/post/nd occurrences is by visualizing them to detect any differences in spatial coordinates. To do this we need to utilize the terra and geodate packages. The terra package is useful for working with Raster data and will frequently be used throughout this workflow. The geodata package will also be used frequently, and allows easy download of geographic boundaries from Global Administrative Areas (GADM).
For now plotting will be done using base R graphics and terra, final publication quality plots will be done later using ggplot2.
First download the basemaps, and put them in a location to access them again later to avoid having to redownload them. These files are large, so consider not pushing them to GitHub if you are working in a repo.
setwd('../occ_data/')
# download/load maps
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
path = "../occ_data/base_maps") #USA basemap w. States
ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
path = '../occ_data/base_maps') #Canada basemap w. Provinces
canUS_map <- rbind(us_map, ca_map) #combine US and Canada vector map
Now plot the occurrence points. Remember that you might have to go back and add additional filters to the code above like I have for example by removing records from Texas, Oklahoma, Louisiana, and New Brunswick.
# M coronaria
plot(canUS_map, xlim = c(-100, -60), ylim = c(25, 50))
# plot Malus coronia occurrences
# pre-1970
points(occ_cor_pre$decimalLongitude, occ_cor_pre$decimalLatitude, pch = 16,
col = alpha("red", 0.2))
# post-1970
points(occ_cor_post$decimalLongitude, occ_cor_post$decimalLatitude, pch = 16,
col = alpha("blue", 0.2))
# No date
points(occ_cor_nd$decimalLongitude, occ_cor_nd$decimalLatitude, pch = 16,
col = alpha('black', 0.2))
legend(x= -75,
y = 33,
title = 'M. coronaria',
legend = c('Pre-1970 (n=412)', 'Post-1970 (n=514)', 'No Date (n=53)'),
fill = c('red', 'blue', 'black'))
dev.off() # close graphics plot window
NOTE: Plotting geographic coordinates can be tricky, and often you will have to play around with the plot limits and legend location to get it to something that is visually appealing.
Inspecting the plot of M. coronria occurrences there is a large amount of overlap between occurrences from pre and post 1970, thus we feel comfortable keeping all the data no matter the year in the analysis. It is important you evaluate this for your own species and ask if it fits your research question.
After visually inspecting the occurrence locations, verifying that there are no duplicates, and no occurrences from strange locations (such as records from the west coast, when dealing with an east coast species) you can save your occurrence data.frame (e.g. occ_cor) for downstream analysis.
So we have downloaded and cleaned the occurrence data for M. coronaria and M. fusca however we still need to do more processing before we can move producing the SDM. We mush now deal with sampling bias in the occurrences. Sampling bias is an issue because with presence-only data it is hard to distinguish if species are being detected because it is truly preferred habitat, or whether those are just the locations that have been sufficiently sampled. This issue can be (partially) addressed by thinning records, so that multiple records from the same are represented by only one (or sometimes a few) occurrence sample. This method is imperfect, but does start to deal with some biases in the spatial representation of the data.
There are several methods for thinning occurrence records; some more sophisticated then others. A simple approach, and the one used in this workflow is randomly sampling a single occurrence per grid cell in the predictor raster (in this case the climate variables).
# Load libraries
library(tidyverse) # Grammar and data management
library(terra) # Working with spatial data
library(geodata) # Basemaps and climate data
Start by loading the cleaned species occurrence data saved in the .Rdata file we made in the occ_clean script.
setwd("../occ_data/")
occ_cor <- readRDS(file = "occ_cor.Rdata")
When working with spatial data it is important to make sure that different data sources share the same coordinate reference system (a.k.a. projection). We are going to be using climatic data from WorldClim as our predictor raster layers, which is projected using the common WGS84 projection. In order to make sure that the coordinates of occurrences line up accurately within the grid cells of the WorldClim data we are going to create a SpatVector object using terra::vect and specify the crs argument so that it knows to project as WGS84.
occ_cor <- vect(occ_cor, geom = c('decimalLongitude', 'decimalLatitude'),
crs = "+proj=longlat +datum=WGS84")
Now we are going to download the WorldClim predictor rasters. WorldClim is a global data set of bioclimatic variables commonly used in ecological research, particular in SDM workflows. Variables in this data set represent a comprehensive set of climatic conditions, including tempature seasonality, perciptication in wettest/driest months and temperature extremes. See https://www.worldclim.org/data/bioclim.html for a complete list of all 19 bioclimatic variables included.
NOTE: There are three resolutions, 10, 5 and 2.5 arcminutes. I choose to use 2.5 arcminutes (and generally recommend) as it is the finest resolution which will give a greater accuracy of environmental conditions, but this is at a cost of increased computation time. For reference 2.5 arcminutes is about 3.3 km at 45 degree latitude. Pick which resolution you think is best for your use.
IMPORTANT NOTE: These downloads are large (several thousand Mb), so if you are working in a GitHub repository make sure to add these files to the .gitignore file to make sure copies are not sent to the repo.
I like to create a separate file for WorldClim data, it makes workflow easier to follow and more reproducible. Create a folder within the head of the project.
setwd("../wclim_data/")
# Note DO NOT PUSH wclim data**
wclim <- worldclim_global(var = 'bio', res = 2.5, version = '2.1', path = "../wclim_data/")
NOTE: Climate data is available at several resolutions (size of grid cells) ranging from 10 - 0.5 minutes of degree. In this workflow we opted to go with 2.5 arc-mins (approx. 5 km) as it provides a compromise in the fine-scale of the cell size and practical computation time.
# plot a raster layer to check it downloaded properly
plot(wclim$wc2.1_2.5m_bio_1, main = 'Annual Mean Temperature')
Now we have vectorized occurrence data and rasterised predictor variables, it is time to thin the occurrences data. We will use the terra::spatSample() function to take a random sample of a single occurrence per raster cell.
set.seed(1337) # set random generator seed to get reproducible results
# M. coronaria thinning
occ_cor <- spatSample(occ_cor, size = 1,
strata = wclim) # sample one occurrence from each climatic cell
And like last time, we will save this thinned data as a .Rdata file for futher downsteam analysis. NOTE: that I call this file a slightly different name, so that I know it is the thinned points.
setwd("../occ_data/")
saveRDS(occ_cor, file = 'occThin_cor.Rdata')
In the Analysis Overview I mentioned in passing that this is a presence/background SDM workflow. Presence/background data sets are cases in which occurrence data is presence-only (like on GBIF), meaning there is no information about the absence of species. This is an important distinction because if we are going to appropriately interpret results, we need to understand the model assumptions. In the presence/background workflow, background (bg) points are randomly sampled from the environment surrounding and overlapping known presence points (more discussion on bg sampling bellow). This effectively samples the distribution of the ‘background’ environment across your study extent. The Max modeling workflow uses this background data to make inferences about absence points, using Baye’s Rule. Thus this workflow produces similar results to those as if you were modeling a presence/absence data set (but if presence/absence data is available, use it!).
See the following for more on background points: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.
So just exactly what is the background area and how can you define a study extent? Should it be simply the convex hull around occurrences? Or should it be all of North America? This depends entirely on the spatial distribution and scale of your data, model assumptions you are willing to live with, and ecological first principles depending on the life histories of your key species and research question. NOTE: Any decision you make will create assumptions and introduce spatial biases to the analysis, and therefore affect the interpretation and predictions of your SDMs.
See the following citation for in depth discussion on defining study extent: Barve, N., Barve, V., Jiménez-Valverde, A., Lira-Noriega, A., Maher, S. P., Peterson, A. T., … & Villalobos, F. (2011). The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological modelling, 222(11), 1810-1819. https://doi.org/10.1016/j.ecolmodel.2011.02.011.
One approach to defining study extent identified by Barve, et al. (2011), is by limiting the background to Biotic regions (hereafter, ecoregions). In short, this method allows for a compromise between real-world biology of species and model tractability. In this workflow I am using North American Ecoregions, as defined by an Internatial collaboration (between Canada, Mexico and the United States) known as the Commission for Environmental Cooperation. Ecoregions are defined at three Levels (I, II and III) or sptail scales, and are available open source at https://www.epa.gov/eco-research/ecoregions-north-america. Choose an appropriate scale for your research question.
In this workflow I opted to go with the Level-II ecoregions, as this scale compromise the scale at which sets of environmental conditions and biogeography are grouped together while still covering relatively large spatial areas that span across NA.
# Load libraries
library(tidyverse)
library(terra)
library(predicts)
library(geodata)
library(ENMTools)
library(plotly) # 3D surface Kernel bivariate plots
library(MASS)
Due to the large size of the raster files, I have saved a local copy to my machine, and will not push these fills to the GitHub repo.
ecoNA <- vect(x = "C:/Users/terre/Documents/UBC/Botanical Garden/Malus Project/maps/eco regions/na_cec_eco_l2/", layer = 'NA_CEC_Eco_Level2')
ecoNA <- project(ecoNA, 'WGS84') # project ecoregion vector to same coords ref as basemap
NOTE: I project the ecoregion rasters the same CSR as the climate predictor rasters.
Let’s take a look at the ecoregions (plotted in red), over top of North America.
# download/load maps
getwd()
setwd('../occ_data/')
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
path = "../occ_data/base_maps") #USA basemap w. States
ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
path = '../occ_data/base_maps') #Canada basemap w. Provinces
canUS_map <- rbind(us_map, ca_map) #combine US and Canada vector map
# plot basemap
plot(canUS_map, xlim = c(-180, -50))
# plot ecoregions
lines(ecoNA, col = 'red')
dev.off()
Now lets load our thinned occurrences from the occ_thin.R script.
# load occurrence data
setwd("../occ_data/")
occThin_cor <- readRDS(file = 'occThin_cor.Rdata')
Now that we have the ecoregions and occurrence points loaded, lets load the WorldClim predictor rasters. We want to mask and crop these predictor layers to the ecoregions. This will defined the study extent, and subsequently allow us to sample the background.
First we need to know which of the level-II ecoregions are known to have species occurrences. To do this we will make use of the function terra::extract. NOTE: It could take a few mins for the extraction to complete
# M. coronaria ecoregions
# extract ecoregion polygon that contain M. coronaria occurrence points
eco_cor <- extract(ecoNA, occThin_cor) # extract what polygons contained points
# return vector of eco region codes of the polygons that contain occurrences
eco_cor_code <- eco_cor$NA_L2CODE %>% unique()
eco_cor_code <- eco_cor_code[eco_cor_code != '0.0'] #remove the 'water' '0.0' ecoregion
# CODES: "8.1" "8.2" "5.3" "8.4" "8.3" "9.4" "8.5" "5.2"
# View the legend on the epa website for the names of these ecoregions.
Now that we have extracted what ecoregion codes contain at least one species occurrence, we can mask the ecoregion raster to only contain those selected above.
ecoNA_cor <- terra::subset(ecoNA, ecoNA$NA_L2CODE %in% eco_cor_code) # subset ecoregion spat vector by the codes
plot(ecoNA_cor) # plot the subsetted M. coronaria ecoregions
points(occThin_cor, pch = 3, col = 'red') # plot M. coronaria points
And as always we want to save this intermediate object as a .Rdata file, so that we do not need to repeat the long computational step of extracting the ecoregions.
setwd('../occ_data/eco_regions')
saveRDS(ecoNA_cor, file = 'ecoNA_cor.Rdata')
Now we can use these ecoregion rasters to crop our predictor climatic rasters. Again we are doing this so that we can define the study extent and sample the environmental background. Make sure to set the mask argument equal to TRUE.
# crop+mask extent of WorldClim data to the Malus ecoregions
wclim_cor <- terra::crop(wclim, ecoNA_cor, mask = T)
# Save cropped wclim data for downsteam SDM workflow
saveRDS(wclim_cor, file = 'wclim_cor.Rdata')
Now we want to randomly sample background points, and store the associated predictor values as SpatVectors. Alternatively you could also use SpatRasters if you please, but I found SpatVectors easier to work with.
IMPORTANT NOTE: How many background points should be sampled? There is not a set default, althought it is common in some studies to see 5000 or 10000 background points. We made the decision to sample 20000 points due to the very large spatial extent of this study. With this many points we can feel confident we the background environment is adequately sampled. Note that this is random, and has not yet addressed any spatial biases that may arise when modeling backgrounds (this is foreshadowing).
set.seed(1337) # set a seed to ensure reproducible results
# NOTE: Set arguments as.raster = T to return raster
# OR as.points to return spatvector = T to return spatvector
# M. coronaria background
# SpatVector
# Note upped bg points from 5000 to 20000 to be more suitable to better reflect a mean probability of presence 1 - a/2
cor_bg_vec <- spatSample(wclim_cor, 20000, 'random', na.rm = T, as.points = T) #ignore NA values
plot(wclim_cor[[1]])
points(cor_bg_vec, cex = 0.01)
We can also measure how many km^2 these ecoregions comprise, to get an idea of how many samples are taken per km^2.
expanse(wclim_cor[[1]], unit = 'km') # total area of raster in km^2
# 5683684 km^2
20000/5683684
# 0.00035 samples/km
Annnnd you guessed it, we want to save these SpatVectors of background points for downsteam workflow.
# Save background SpatVectors
setwd("../occ_data/")
saveRDS(cor_bg_vec, file = 'cor_bg_vec.Rdata')
You also may want to extract the actual predictor values and put them into a data.frame for both the occurrence points and background points. To do this you can again make use of the terra::extract function like above, and the terra::values function to get the values from the already bg SpatVectors. I show both as it can be useful, but I use mainly the SpatVectors downstream.
# Extracting presence-background raster values
cor_predvals <- extract(wclim_cor, occThin_cor) # M. coronaria
cor_predvals <- cor_predvals[-1] # drop ID column
cor_bgvals <- values(cor_bg_vec) # Extract raster values for bg points
Now we can create and save a data.frame which may come in handy later.
cor_pb <- c(rep(1, nrow(cor_predvals)), rep(0, nrow(cor_bgvals))) #binary (0,1) presence or background string
# combine presence and background data frames for SDM
cor_sdmData <- data.frame(cbind(cor_pb, rbind(cor_predvals, cor_bgvals)))
setwd("../occ_data/")
saveRDS(cor_sdmData, file = 'cor_sdmData.Rdata')
So we now have successfully defined a study extent using ecoregions that contain occurrences, and croped a predictor raster so that background points could be sampled from it. Then using these background and occurrence points we extracted the bioclimatic values in the predictor raster.
Using these values in a data frame (e.g. cor_sdmData) we can complete several colinearity checks. These checks allow us to understand how the climatic variables relate to one another. It is important to note, that in a classical statistical approach it is considered good practice to limit colinear variables, to prevent over fitting of a model. This step is refereed to as variable selection. With that said, MaxEnt is not a linear modeling approach, and thus does not need to follow assumptions about predictor colinearity. On top of that, MaxEnt is known to be robust to including many predictors.
See the following for more on background points: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.
We can start by simply plotting the relationship between all predictor pairs, using the base R graphics. Use the function pairs. This can get messy quickly with some many variables, so it is a bit hard to read. It may take a minute to plot…
# M. coronaria
pairs(cor_sdmData[,-1]) # drop the first column of 0/1
Let’s use a more helpful way to detect colinearity of predictor variables, a dendrogram, which visualizes correlation matrices. NOTE: This is only considering two-dimensional covariation, and cannot detect higher dimensional correlations.
# Dendogram cluster of predictor colinearity
threshold <- 0.7 # set the threshold for colinearity
cor_cors <- raster.cor.matrix(wclim_cor) # pearson correlation
cor_dist <- as.dist(1 - abs(cor_cors)) # calculate distance of predictors
cor_clust <- hclust(cor_dist, method = 'single') # calculate cluster dendogram
cor_groups <- cutree(cor_clust, h = 1 - threshold) #calculate groupings of variables
plot(cor_clust, hang = -1, main = "M. coronaria Predictors")
rect.hclust(cor_clust, h = 1 - threshold)
We can also visualize the predictor covariate values to understand how they inform the model about the probability of a true presence. This involves the visualization of the distribution of the values at sampled points, presence and background, which allows us to get a peak into how the model constraints come about.
To do this we can compare the density of covariates from occurrence points and background points. I like to visualize this density using Kernel Density Plots. This plot allows you to visualize three dimensions, in this case two of the bioclimatic predictor variables, and the density of points sampled (pseudo-probability). We will build two separate plots to visualize the density of occurrence points and of background points.
First we will create new data frames for each of the climatic variables (in this case mean annual temperature, and mean annual precipitation), so that we can plot them below.
# Predictor Kernel Density Plots
# It is helpful to visualize two predictors pairs of
# Presence points and background points
cor_occ.temp <- cor_sdmData %>% filter(cor_pb == 1) %>% # Presence points
dplyr::select(wc2.1_2.5m_bio_1) %>% # Mean annual temp
drop_na() %>%
unlist()
cor_bg.temp <- cor_sdmData %>% filter(cor_pb == 0) %>% # Background points
dplyr::select(wc2.1_2.5m_bio_1) %>% # Mean annual temp
drop_na() %>%
unlist()
cor_occ.perc <- cor_sdmData %>% filter(cor_pb == 1) %>% # Presence points
dplyr::select(wc2.1_2.5m_bio_12) %>% # Annual precipitation
drop_na() %>%
unlist()
cor_bg.perc <- cor_sdmData %>% filter(cor_pb == 0) %>% # Background points
dplyr::select(wc2.1_2.5m_bio_12) %>% # Annual precipitation
drop_na() %>%
unlist()
Now that all the data is prepared, we will build the build the Kernel bivariate grid, using the function MASS::kde2d. This function needs three values, the x coordinate = temperature values, y coordinate = precipitation values, and the z coordinate is density of occurrence (pseudo-probability).
cor_occ.3d <- kde2d(cor_occ.temp, cor_occ.perc)
cor_bg.3d <- kde2d(cor_bg.temp, cor_bg.perc)
And plot the results using the package plot_ly. These build interactive 3D plots, and in this case can be used to visualize the Kernel plots.
plot_cor.occ_3d <- plot_ly(x=cor_occ.3d$x, y=cor_occ.3d$y, z=cor_occ.3d$z) %>%
add_surface() %>%
layout(scene = list(xaxis = list(title = 'Mean Annual Temp (C)', autotick = F, nticks = 5, tickvals = list(0,5,10,15,20)),
yaxis = list(title = 'Mean Annual Percip. (mm)', tick0=0, tick1=2000, dtick=200),
zaxis = list(title = 'Kernel Density', tick0=0, tick1=0.001, dtick=0.0002)),
title = list(text = "<i>M. coronaria<i> Occurrence Points",
y = 0.95, x = 0.5,
xanchor = 'center',
yanchor = 'top'))
plot_cor.occ_3d # run to view
plot_cor.bg_3d <- plot_ly(x=cor_bg.3d$x, y=cor_bg.3d$y, z=cor_bg.3d$z) %>%
add_surface() %>%
layout(scene = list(xaxis = list(title = 'Mean Annual Temp (C)', tick0=0, tick1=20, dtick=5),
yaxis = list(title = 'Mean Annual Percip. (mm)', tick0=0, tick1=2000, dtick=200),
zaxis = list(title = 'Kernel Density')),
title = list(text = "<i>M. coronaria<i> Background Points",
y = 0.95, x = 0.5,
xanchor = 'center',
yanchor = 'top'))
plot_cor.bg_3d
In this example I only visualize two of the 19 different bioclimatic variables. You may choose to visualize more for your own purposes. Remember this is only a visual comparison, but is valuable for understanding what the model is ‘seeing’ - of course it is using multidimensional space of all covariates, not just the two here.
So now have all we need (almost) to build the SDMs, finally. We have the presence (occurrence) points, and background points. We also have the climatic predictor rasters (WorldClim bioclimatic variables). What we don’t have is the future bioclimatic data, which we will use with the results of the SDM to make predictions about future suitable habitat, under different climatic scenarios.
library(tidyverse) # Grammar and data management
library(terra)# Spatial Data package
library(predicts) # SDM package
library(geodata) # basemaps
library(rJava) # MaxEnt models are dependant on JDK
library(ENMeval) # Another modeling package, useful for data partitioning (Checkerboarding)
library(raster) # RasterStack dependancy (a now deprecated function)
library(ecospat) # Useful spatial ecology tools
library(parallel) # speed up computation by running in parallel
library(doParallel) # added functionality to parallel
Let’s start by loading in the thinned occurrence points and background points, along with some base maps of Canada, USA, and Mexico.
# Download/load basemaps
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
path = "../occ_data/base_maps") #USA basemap w. States
ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
path = '../occ_data/base_maps') # Canada basemap w. Provinces
mex_map <-gadm(country = 'MX', level = 1, resolution = 2,
path = '../occ_data/base_maps') # Mexico basemap w. States
canUSMex_map <- rbind(us_map, ca_map, mex_map) # Combine Mexico, US and Canada vector map
NA_ext <- ext(-180, -30, 18, 85) # Set spatial extent of analyis to NA in Western Hemisphere
canUSMex_map <- crop(canUSMex_map, NA_ext) # crop to Western Hemisphere
plot(canUSMex_map) # plot basemap
Now lets download the future bioclimatic data. For this we will need to specify a few additional arguments then before when downloading the historical climate data. We will again use the geodata package, but instead of worldclim_global we will use cmip6_world.
This function will download data for projected future climates from Coupled Model Intercomparison Project Phase 6 (CMIP6) generated from Global Climate Models (GCMs). CMIP6 projections are based on the Shared Socio-economic Pathway (SSP) scenarios, included in the Intergovernmental Panel on Climate Change (IPCC) Assessment Report (AR6).
In There are six different SSPs, four of which are available through geodata, “126”, “245”, “370” and “585”; for SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5. This is also available for three time periods: “2021-2040”, “2041-2060”, and “2061-2080”; or early century (2030), mid century (2050) and late century (2070). In this workflow I will model all three timeframes, and I chose two SSPs, SSP2-4.5 and SSP5-8.5. The former repersents a ‘middle of the road’ scenario with high climate adaptation, low climate mitigation; and the later is a ‘worst case’ or ‘bussiness as usual’ scenario where the global development continues down a unsustainable route with increased fossil fuel reliance. It is worth noting that SSP5-8.5 is the current tracking scenario: https://news.stanford.edu/2023/01/30/ai-predicts-global-warming-will-exceed-1-5-degrees-2030s/.
The final decision to make is which CMIP6 climate model you are going to choose. There are several models available through geodata, all of which have their individual strengths and weaknesses. For the purposes of this study, as I am primary concerned with Malus habitat in Canada, I chose to use the Canadian Earth System Model version 5 (CanESM5). This model is still a global model, but it is most commonly employed for climate research in Canada. Read more here: https://gmd.copernicus.org/articles/12/4823/2019/gmd-12-4823-2019.html
NOTE: Climate projections, are just that, projections. These models are useful, but can be flawed for a host of reasons. Recently it has come to the attention of climate modelers that many of the models included in CMIP6 are ‘too hot’, meaning the rates of warming may be too high, although this is not true for every region. Read more here: https://www.science.org/content/article/use-too-hot-climate-models-exaggerates-impacts-global-warming.
# SSP (Shared social-economic pathway) 2.45
# middle of the road projection, high climate adaptation, low climate mitigation
ssp245_2030 <- cmip6_world(model = "CanESM5",
ssp = "245",
time = "2021-2040",
var = "bioc",
res = 2.5,
path = "../wclim_data/")
ssp245_2050 <- cmip6_world(model = "CanESM5",
ssp = "245",
time = "2041-2060",
var = "bioc",
res = 2.5,
path = "../wclim_data/")
ssp245_2070 <- cmip6_world(model = "CanESM5",
ssp = "245",
time = "2061-2080",
var = "bioc",
res = 2.5,
path = "../wclim_data/")
# SPP 5.85
# low regard for enviromental sustainability, increased fossil fuel reliance, this is the current tracking projection
ssp585_2030 <- cmip6_world(model = "CanESM5",
ssp = "585",
time = "2021-2040",
var = "bioc",
res = 2.5,
path = "../wclim_data/")
ssp585_2050 <- cmip6_world(model = "CanESM5",
ssp = "585",
time = "2041-2060",
var = "bioc",
res = 2.5,
path = "../wclim_data/")
ssp585_2070 <- cmip6_world(model = "CanESM5",
ssp = "585",
time = "2061-2080",
var = "bioc",
res = 2.5,
path = "../wclim_data/")
NOTE: Just like with the previous climate data, DO NOT push these large files to a github repo due to the size of the files.
Now that we have the future climate data we need to do a bit more prep work to get ready for the SDMs. First we should load in the cropped climate rasters from the background sampling script. I am doing this so I can visualize a spatial sampling method called checkerboarding (more details to come below).
# These Rasters are useful for sampling spatial checkerboards
# and making habitat suitability predictions (Historical and under future SSPs climate scenarios)
setwd('../wclim_data')
# Historical (1970-2000)
wclim_cor <- readRDS(file = 'wclim_cor.Rdata')
wclim_cor_stack <- raster::stack(wclim_cor) # covert SpatRaster to RasterStack for dependency in ENMeval checkboarding
wclim_fus <- readRDS(file = 'wclim_fus.Rdata')
wclim_fus_stack <- raster::stack(wclim_fus) # covert SpatRaster to RasterStack for dependency in ENMeval checkboarding
NOTE: The stack function is from the deprecated package raster so it may not work in future versions of ENMeval (package for checkerboarding and SDM). Similar function from the more up to data terra package is terra::c, which can be used to combine SpatRasters, but does not have the required RasterStack object signature.
Then for ease of making predictions downsteam we will extract the names of each bioclimatic raster layer so that we can rename the layers from the projected climates. This is so that when making predictions using our SDM downstearm the names correspond.
climate_predictors <- names(wclim_cor) # extract climate predictor names, to rename layers in the rasters below
# This is important to do for making predictions once the SDMs have been made on future climate data
# Note that the names of the layers still correspond to the same environmental variables
# Future SSPs
# Do not need to create RasterStacks
# SSP 245
names(ssp245_2030) <- climate_predictors #rename raster layers for downsteam analysis
names(ssp245_2050) <- climate_predictors
names(ssp245_2070) <- climate_predictors
# SSP 585
names(ssp585_2030) <- climate_predictors #rename raster layers for downsteam analysis
names(ssp585_2050) <- climate_predictors
names(ssp585_2070) <- climate_predictors
So now we have all our data prepped and its time to start getting things sorted for the SDM. In the next few pieces of code I am going to show you how to manually partition data into training and testing groups. There are several reasons for doing this, including and most importantly evaluating the performance and generalization of the model. This is called cross-validation, whereby the species occurrence data is partitioned into training data that gets inputted into the model, and testing data that is withheld and used to evaluate how well the trained model fits the ‘new’ testing occurrences. There are several different ways to partition data, all of which have pros/cons.
The simplest method is k-fold random, where data is equally binned into groups (‘folds’). However, this can be flawed as can preserve bias in the training and testing data. It is preferred to use spatial partitioning to reduce spatial autocorrelation preserve spatial heterogeneity, which can help reduce model overfitting. This creates partitions that are much more ecologically relevant so that the model can learn how the species is spatially organized, which in turn makes the predictions of the model more generalizable for future climate predictions.
There are several different methods to spatially partition occurrence data. A common approach is to overlay a square grids and take a checkerboard pattern of occurrences, where occurrences are partitioned depending on what square they fall in. The method I will demonstrate will complete 4-fold cross validation, where four partitions are taken - three will be used for training and one for testing. As I mentioned I will show you how to do this manually, but the MaxEnt function we will use later will do this partitioning for us, as well are cycle through and withhold each group.
Read more about spatial partitioning and checkerboarding in the following two articles.
Radosavljevic, A., & Anderson, R. P. (2014). Making better Maxent models of species distributions: complexity, overfitting and evaluation. Journal of biogeography, 41(4), 629-643. https://doi.org/10.1111/jbi.12227.
Muscarella, R., Galante, P. J., Soley‐Guardia, M., Boria, R. A., Kass, J. M., Uriarte, M., & Anderson, R. P. (2014). ENM eval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in ecology and evolution, 5(11), 1198-1205. https://doi.org/10.1111/2041-210X.12261.
Let’s get started with the partitions.
# Spatial partitioning preparation
occ_cor_coords <- as.data.frame(geom(occThin_cor)[,3:4]) # extract longitude, lattitude from occurence points
bg_cor_coords <- as.data.frame(geom(cor_bg_vec)[,3:4]) # extract longitude, lattitude from background points
occ_fus_coords <- as.data.frame(geom(occThin_fus)[,3:4]) # extract longitude, lattitude from occurence points
bg_fus_coords <- as.data.frame(geom(fus_bg_vec)[,3:4]) # extract longitude, lattitude from background points
The geom function will get the geometry of the occurrences from the SpatVectors and store them in data frames.
# Spatial Checkerboard partitioning of occurrence and background points cross-fold validation
set.seed(1337)
agg_factor <- c(9,9) # this defines how many adjacent cells are aggregated into the same checkboard 'sqaure'
# I visually tested different aggregations and found 9,9 was the best fit given the spatial extent
Setting a random generator seed will ensure that randomized partitions yield the same results consistently. The aggregation (agg) factor determines at what scale the occurrences are aggregated in each grid cell. The first value species the fine scale
NOTE: We are using the get.checkerboard2 function to get the 4-fold cross validation. This function is convenient but there is no reason why you could not do more partitions if you wanted to.
There are four required arguments: 1. the species occurrences coordinates, 2. background point coordinates, 3. a predictor RasterStack that is cropped to the extent of the background, and 4. the above defined aggregation factor.
Then we can use the evalplot.grps function to visualize the partitions.
cor_cb <- get.checkerboard2(occs = occ_cor_coords,
bg = bg_cor_coords,
envs = wclim_cor_stack,
aggregation.factor = agg_factor
)
evalplot.grps(pts = occ_cor_coords, pts.grp = cor_cb$occs.grp, envs = wclim_cor_stack, pts.size = .75) # plot the checkerboard partitions of occurrence points
evalplot.grps(pts = bg_cor_coords, pts.grp = cor_cb$bg.grp, envs = wclim_cor_stack, pts.size = .75) # plot the checkerboard partitions of occurrence bg points
# background points may return an error if the checkboard sampling, this can happen due to the aggregation factor removing some obs
# It is fine as we do not need to visualize groups. If you want to, must remove the difference in observations.
First lets look at the species occurrence partitions. But, it is hard to see the checkerboard pattern.
Now let’s look at the background points. As you can see the checkerboarding is much clearer. I also show how adjusting the aggregation factor changes the scale at which the partitions are taken from.
3,3 aggregation
6,6 aggregation
9,9 aggregation, which I chose to use!
12,12 aggregation
Okay, now that we have played around with the understanding how background point sampling works lets dive into developing our MaxEnt model. We are going to use parallel processing, to utilize multiple processors (CPUs) simultaneously. This will increase the amount of available memory and significantly decrease the computational time. If you are using a Windows computer (like me), then you only have one type of parallel processing available to you, known as ‘socket’ clusters. Read more here: https://www.r-bloggers.com/2019/06/parallel-r-socket-or-fork/#:~:text=As%20a%20result%2C%20it%20runs,40%25%20faster%20than%20the%20socket.
First let’s take a look at the code chunk and then I will break each argument down.
# Build Species Distribution Model using MaxEnt from the <ENMeval> package
# Run prediction in a parallel using 'socket' clusters to help speed up computation
# <ENMeval> implements parallel functions natively
# But load <parallel> library for additional functions like <decectCores()>
cn <- detectCores(logical = F) # logical = F, is number of physical RAM cores in your computer
set.seed(1337)
cor_maxent <- ENMevaluate(occ = occ_cor_coords, # occurrence records
envs = wclim_cor, # environment from background training area
n.bg = 20000, # 20000 bg points
tune.args =
list(rm = seq(0.5, 8, 0.5),
fc = c("L", "LQ", "H",
"LQH", "LQHP", "LQHPT")),
partition.settings =
list(aggregation.factor = c(9, 9), gridSampleN = 20000), # 9,9 agg
partitions = 'checkerboard2',
parallel = TRUE,
numCores = cn - 1, # leave one core available for other apps
parallelType = "doParallel", # use doParrallel on Windows - socket cluster
algorithm = 'maxent.jar')
NOTE: The cn object is the number of physical CPUs your computer has. The function detectCores requires that you load the parallel library. Also, make sure to set the random generator seed to achieve consistent results.
Breakdown of the ENMevaluate function
The first argument that it is expecting is a data.frame of all species occurrences with two columns, longitude and latitude, in that specific order.
Like in the example we did above, we want to set n.bg = 20,000 background points to get a comprehensive sample.
Next the the envs argument is SpatRaster of bioclimatic predictor data cropped to the background extent (or ecoregions). We can reuse the cropped raster we produced in the malus_bg.R script.
Now we can adjust, or tune, the arguments of the MaxEnt algorithm with tune.args.
rm stands for “regularization multiplier”, which affects how focused or closely-fitted the output distribution is. Smaller values can lead to model overfitting, so it is good to test different values.fc stands for “feature class”. This corresponds to a mathematical transformation of the different covariates used in the model to allow complex relationship to be modeled. L = linear, Q = quadratic, H = hinge, P = product and T = threshold.partition.settings is where to control the aggregation factor we played around with above. As well make sure to set the number of grid cells included in the partitioning equal to the number of background points sampled. It is 10,000 by default.partitions is how to select what cross-validation method will be used. Like above we are going to use the 4-fold checkerboad partitioning. The benefit of doing this in the ENMevalutate function is that the model will parition and then cross-validate across all groupings, meaning it will rotate what data is used in training and what is used for testing. It will then produce a continuous Boyce Index (cbi) to correlate each set of testing data against model fits.parallel set this equal to TRUE to ensure parallel processing is enabled. Likely if not enabled you will get a runtime or memory error - especially with complex models.numCores is where we want to specify how many cores RStudio can simultaneously access. It is good practice to leave one core free to run other processes on your computer.parallelType set to “doParallel” to run a socket cluster. This is the only type available on Windows. The other method, “doSNOW”, will run a fork cluster - which with can decrease computational time depending - but is only available on Mac and Linux.algorithm is how to select which SDM you want to use. In this case I am using the maxent.jar algorithm, which is a version of MaxEnt implemented in the dismo package using Java Script. Another version of MaxEnt is also available, implemented in native R with the maxnet package. It is worth noting that their are subtle differences. Personally I find the memory usage with maxnet to be much greater than maxent.jar, which equals longer computational time. The benefit of maxnet is you do not have to install JDK, which can be a challenge.NOTE: The above model took 39 minutes and 32.8 seconds to compute. My laptop has a 3.2 GHz AMD Ryzen 7 5800H processor with eight cores. Depending on your own PC specs, model complexity including number of different tunings, and study extent your modeling could take longer/shorter depending.
And just like all the previous long computations we want to save a .Rdata file to reload the model if needed.
# Save the MaxEnt model so you do not have to waste time re-running the model
setwd('../sdm_output')
saveRDS(cor_maxent, file = 'cor_maxent.Rdata') # save
cor_maxent <- readRDS(file = 'cor_maxent.Rdata') # load
The next step after running several model combinations (96 to be percise), is to select what ‘best’ model (with particular tuning arguments). A common approach is to use Akaike information criterion (AIC), specifically delta AICc, which is a corrected AIC value, expressed as the relative difference in AICc values of each model. Delta AICc values less than 2, are considered to be equally good models. In my case, there was only a single model less than 2, the best performing model. Typically, to deal with tiebreaks between well performing models, the least complex (meaning simplest fc) and/or the model with the lowest rm value.
# Note that maxent results provide Continuous Boyce Index (cbi)
best_cor_maxent <- subset(cor_maxent@results, delta.AICc == 0) # selects the best performing model based on delta AICc - returns data frame object
mod.best_cor_maxent <- eval.models(cor_maxent)[[best_cor_maxent$tune.args]] # extracts the best model - returns MaxEnt object
So now that we have selected our best performing MaxEnt model we can start to make some predictions! We are going to use the terra package to create a SpatRaster of habitat suitability scores. First, lets start by making some predictions using the historical WorldClim data. In this case we want to use the uncropped WorldClim SpatRaster, to make predictions across North America. This will become more clear why once we get to the future climate change predictions.
# Wclim is the historical climatic conditions (1970-2000)
cor_pred_hist <- terra::predict(wclim, mod.best_cor_maxent, cores = cn - 1)
plot(cor_pred_hist)
points(occThin_cor, cex = 0.05)
We have made our first prediction of the fundamental niche of M. coronaria! But there are a few more things to consider before moving on.
First, is how confident can we be of these predictions? Using metrics like Area Under the Curve (AUC) and Continous Boyce Index can help us, however they have their own flaws, and so its important to think critically about the results of your predictions. Note that ENMevaluate and the MaxEnt algorithm automatically compute several indices, including separate indices calculated on the training data, and those calculated on the validation (testing) data.
ecospat.boyce uses a moving window, where it forms classes by 1/10th of the suitability range. The Boyce Index is then calculated using a Spearman rank correlation metric between P/E scores and habitat suitability scores. What we expect to see is a steady increase in P/E with an increase suitability scores.
# Evaluate predictions using Boyce Index
# the number of true presences should decline with suitability groups 100-91, 90-81, etc.
# First extract suitability values for the background and presence points, make sure to omit NA values
corPred_bg_val <- terra::extract(cor_pred_hist, bg_cor_coords)$lyr1 %>%
na.omit()
corPred_val_na <- terra::extract(cor_pred_hist, occ_cor_coords)$lyr1 %>%
na.omit()
# Evaluate predictions using Boyce Index
ecospat.boyce(fit = corPred_bg_val, # vector of predicted habitat suitability of bg points
obs = corPred_val, # vector of
nclass = 0,
PEplot = TRUE,
method = 'spearman')
Second, is it best to leave these suitability scores as a continuous gradient of habitat suitability? Myself (and many others) find maps with colour gradients illegible, or less informative at a glance. Some take the approach of creating a binary threshold, where habitat suitability is boolean. There are several methods to determine this binary threshold, which can be done using the predicts package. A flaw of this approach is it less tangible what these thresholds mean in lay terms.
cor_pa <- predicts::pa_evaluate(p = occ_cor_coords_mat, a = bg_cor_coords_mat, model = cor_maxent, x = wclim_cor)
cor_threshold <- predicts::threshold(cor_pa)
cor_hist_habitat <- cor_pred_hist > cor_threshold$max_spec_sens #the threshold at which the sum of the sensitivity (true positive rate) and specificity (true negative rate) is highest
However, I think it is more practical to bin suitability scores into several categories. Lets say three categories: “High suitability”, “Moderate suitability” and “Low suitability”. These categories are determined using qauntiles, where suitability scores in the lowest percentiles can be considered to be lowly or moderately suitable, and higher scores considered to be of high suitability. In this case I use the 1st, 10th, and 50th percentile to classify these categories. This means that in the low habitat category we are only omitting 1% of predicted suitable habitat, compared to the 50th, were we are omitting 50% of predicted values. It is important to understand that this approach also has flaws, but typically the 10th percentile is considered a acceptable minimum threshold.
corPred_val <- terra::extract(cor_pred_hist, occ_cor_coords)$lyr1
corPred_threshold_1 <- quantile(corPred_val, 0.01, na.rm = T) # Low suitability
corPred_threshold_10 <- quantile(corPred_val, 0.1, na.rm = T) # Moderate suitability
corPred_threshold_50 <- quantile(corPred_val, 0.5, na.rm = T) # High suitability
Now let’s visualize the results! As mentioned previously, we will use ggplot2 downstream to make publication quality plots. But, this is a great start!
legend_labs <- c('Low Suitability (1st percentile)', 'Moderate Suitability (10th percentile)', 'High Suitability (50th percentile)')
fill_cols <- c('#D81B60', '#1E88E5', '#FFC107') # colour blind friendly
par(mar = c(5, 5, 5, 5))
terra::plot(cor_pred_hist > corPred_threshold_1, col = c('lightgrey', '#D81B60'), legend = F, xlim = c(-100, -60), ylim = c(25, 50), main = 'Historical (1970-2000)')
terra::plot(cor_pred_hist > corPred_threshold_10, col = c(NA, '#1E88E5'), add = T, legend = F)
terra::plot(cor_pred_hist > corPred_threshold_50, col = c(NA, '#FFC107'), add = T, legend = F)
terra::plot(canUSMex_map, add = T)
points(occThin_cor, col = 'black', cex = 0.75, pch = 4)
legend(x = -75, y = 30, xpd = NA, inset = c(5, 0),
title = 'Habitat Suitability',
legend = legend_labs,
fill = fill_cols)
Now we can move on to make predictions for the future climate scenarios. We will still use the same threshold values from the historical prediction to keep interpretation consistent. As always, make sure to save the predictions to save time when having to reload.
# Future Climate predictions
# SSP 245
cor_pred_ssp245_30 <- terra::predict(ssp245_2030, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp245_50 <- terra::predict(ssp245_2050, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp245_70 <- terra::predict(ssp245_2070, mod.best_cor_maxent, cores = cn - 1)
# SSP585
cor_pred_ssp585_30 <- terra::predict(ssp585_2030, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp585_50 <- terra::predict(ssp585_2050, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp585_70 <- terra::predict(ssp585_2070, mod.best_cor_maxent, cores = cn - 1)
# Save/Load M cor. SDM predictions ----------------------------------------
setwd('../sdm_output')
saveRDS(cor_pred_hist, file = 'cor_pred_hist.Rdata')
saveRDS(cor_pred_ssp245_30, file = 'cor_pred_ssp245_30.Rdata')
saveRDS(cor_pred_ssp245_50, file = 'cor_pred_ssp245_50.Rdata')
saveRDS(cor_pred_ssp245_70, file = 'cor_pred_ssp245_70.Rdata')
saveRDS(cor_pred_ssp585_30, file = 'cor_pred_ssp585_30.Rdata')
saveRDS(cor_pred_ssp585_50, file = 'cor_pred_ssp585_50.Rdata')
saveRDS(cor_pred_ssp585_70, file = 'cor_pred_ssp585_70.Rdata')
For the purposes of this tutorial I am only going to show results of one of the future climate predictions - SSP5-8.5 2020-2040 (the current tracking projection). I am using the same code as above for the historical suitability plot.
We can see a clear shift in the fundamental niche of M. coronaria northward. At the same time, it seems that the climate for M. coronaria will be favourable in Southern Ontario, at least in the near future. Also pay attention to areas where fundamental habitat is being lost. Remember that these are simply predictions, that carry a lot of assumptions.